
Replacing employees is a costly task. Not only are there costs involved with recruting talented people and getting them up to speed, but losing top talent can be a source of competitive disadvantage. A far better alternative is to retain employees to a high degree, especially those who are highly specialized and/or exceptionally high performing.
What if we could reliably predict who is likely to leave the company, before it happens, and design interventions to retain them? Here I will demonstrate buildling a prediction model desgined at identifying high flight-risk employees as well as unpacking what makes people risky in hopes to prescribe an effective retention intervention strategy.
For this example I utilize a public dataset from Kaggle: "IBM HR Analytics Employee Attrition & Performance". It contains information on 1470 employees, including whether each left the company, along with characteristics of each individual, indluding information about their education, job, sentiment, experience, performance, compensation, and life situation.
# import packages and dataset
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/christianclayton7/ibm-attrition-prediction/main/ibm_attrition_data.csv'
df = pd.read_csv(url)
pd.options.display.max_columns = None
# preview data
df.head()
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
There was no missing data that needed to be dropped or filled in. A few features have the same value for all rows (standard hours, employee count, Over 18) and thus contain no useful information for our model and can be dropped. Numeric features have varying scales (sentiment features are 1-4, performance 3 or 4, while monthly income is in the tens of thosands) and amount of variance, so standardizing them to the same scale before modeling can help with interpretation and is imporant for some model types.
The target class (Attrition) is imbalanced. Only 16% of the population left in the period in question. This means that if we guessed that everyone stays rather than leaves we'd be right 84% of the time, but we'd be wrong about all the people that did leave. As we evaluate our prediction models we will want to increase that overall accuracy above 84% and importantly predict correctly when someone does leave, an accuracy metric called 'Recall'.
# look at columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
# look at summary stats of numeric features
pd.options.display.precision = 2
df.describe()
| Age | DailyRate | DistanceFromHome | Education | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | MonthlyRate | NumCompaniesWorked | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.0 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.0 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 | 1470.00 |
| mean | 36.92 | 802.49 | 9.19 | 2.91 | 1.0 | 1024.87 | 2.72 | 65.89 | 2.73 | 2.06 | 2.73 | 6502.93 | 14313.10 | 2.69 | 15.21 | 3.15 | 2.71 | 80.0 | 0.79 | 11.28 | 2.80 | 2.76 | 7.01 | 4.23 | 2.19 | 4.12 |
| std | 9.14 | 403.51 | 8.11 | 1.02 | 0.0 | 602.02 | 1.09 | 20.33 | 0.71 | 1.11 | 1.10 | 4707.96 | 7117.79 | 2.50 | 3.66 | 0.36 | 1.08 | 0.0 | 0.85 | 7.78 | 1.29 | 0.71 | 6.13 | 3.62 | 3.22 | 3.57 |
| min | 18.00 | 102.00 | 1.00 | 1.00 | 1.0 | 1.00 | 1.00 | 30.00 | 1.00 | 1.00 | 1.00 | 1009.00 | 2094.00 | 0.00 | 11.00 | 3.00 | 1.00 | 80.0 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 30.00 | 465.00 | 2.00 | 2.00 | 1.0 | 491.25 | 2.00 | 48.00 | 2.00 | 1.00 | 2.00 | 2911.00 | 8047.00 | 1.00 | 12.00 | 3.00 | 2.00 | 80.0 | 0.00 | 6.00 | 2.00 | 2.00 | 3.00 | 2.00 | 0.00 | 2.00 |
| 50% | 36.00 | 802.00 | 7.00 | 3.00 | 1.0 | 1020.50 | 3.00 | 66.00 | 3.00 | 2.00 | 3.00 | 4919.00 | 14235.50 | 2.00 | 14.00 | 3.00 | 3.00 | 80.0 | 1.00 | 10.00 | 3.00 | 3.00 | 5.00 | 3.00 | 1.00 | 3.00 |
| 75% | 43.00 | 1157.00 | 14.00 | 4.00 | 1.0 | 1555.75 | 4.00 | 83.75 | 3.00 | 3.00 | 4.00 | 8379.00 | 20461.50 | 4.00 | 18.00 | 3.00 | 4.00 | 80.0 | 1.00 | 15.00 | 3.00 | 3.00 | 9.00 | 7.00 | 3.00 | 7.00 |
| max | 60.00 | 1499.00 | 29.00 | 5.00 | 1.0 | 2068.00 | 4.00 | 100.00 | 4.00 | 5.00 | 4.00 | 19999.00 | 26999.00 | 9.00 | 25.00 | 4.00 | 4.00 | 80.0 | 3.00 | 40.00 | 6.00 | 4.00 | 40.00 | 18.00 | 15.00 | 17.00 |
# look at proportion of each level in each category feature
for col in df.select_dtypes(include='object').columns:
print("\n", df[col].value_counts(normalize = True).round(2))
Attrition No 0.84 Yes 0.16 Name: proportion, dtype: float64 BusinessTravel Travel_Rarely 0.71 Travel_Frequently 0.19 Non-Travel 0.10 Name: proportion, dtype: float64 Department Research & Development 0.65 Sales 0.30 Human Resources 0.04 Name: proportion, dtype: float64 EducationField Life Sciences 0.41 Medical 0.32 Marketing 0.11 Technical Degree 0.09 Other 0.06 Human Resources 0.02 Name: proportion, dtype: float64 Gender Male 0.6 Female 0.4 Name: proportion, dtype: float64 JobRole Sales Executive 0.22 Research Scientist 0.20 Laboratory Technician 0.18 Manufacturing Director 0.10 Healthcare Representative 0.09 Manager 0.07 Sales Representative 0.06 Research Director 0.05 Human Resources 0.04 Name: proportion, dtype: float64 MaritalStatus Married 0.46 Single 0.32 Divorced 0.22 Name: proportion, dtype: float64 Over18 Y 1.0 Name: proportion, dtype: float64 OverTime No 0.72 Yes 0.28 Name: proportion, dtype: float64
In exploring how numeric features are distrubuted for leavers vs stayers we see the following is true for leavers relative to stayers... Leavers are:
# view distribution of numeric features (excepting ones with very few unique values) by attrition
drop_cols = ['StandardHours','EmployeeCount','EmployeeNumber',
'WorkLifeBalance','RelationshipSatisfaction','JobSatisfaction','JobInvolvement','EnvironmentSatisfaction',
'PerformanceRating', 'StockOptionLevel']
num_cols = df.drop(drop_cols, axis = 1).select_dtypes(include='int64').columns
for col in num_cols:
sns.catplot(kind = 'boxen', data = df, x = col, y = 'Attrition')
plt.show()
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
/Users/christianclayton/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
# view proportion of leaves in each level of ordinal variable (numeric with few unique values)
cat_cols = df[['WorkLifeBalance','RelationshipSatisfaction','JobSatisfaction','JobInvolvement','EnvironmentSatisfaction',
'PerformanceRating', 'StockOptionLevel']].columns
for col in cat_cols:
temp = df.groupby(col).value_counts(['Attrition'], normalize = True).reset_index()
sns.barplot(data = temp,
x = col,
y = 'proportion',
hue = 'Attrition')
plt.show()
When we inspect the category features we find the proportion of leavers vs stayers changes quite a bit depending on the category. This indicates the feature will be useful in classifying leavers vs stayers. The following features have a lot of variance between categories and will likely be good predictors:
# view proportion of leaves in each category level of category variables
cat_cols = df.drop('Over18', axis = 1).select_dtypes(include='object').columns
cat_cols = cat_cols[cat_cols != 'Attrition']
for col in cat_cols:
temp = df.groupby(col).value_counts(['Attrition'], normalize = True).reset_index()
sns.barplot(data = temp,
x = col,
y = 'proportion',
hue = 'Attrition')
plt.xticks(rotation=90)
plt.show()
Many of the features seem similar and proboably contain redundant information from the perspecitve of our attrition model.
A correlation plot of our numeric features reveals that there is a lot of correlation between features having to do with experience level (age, job level, total experience, tenure, time in role, time since promo, time with mgr). Rather than dropping I will perform principle components analysis on these features in order to preserve signal to the model from these features while reducing redundancy of information.
# function for sorting the correlation matrix
import scipy
import scipy.cluster.hierarchy as sch
def cluster_corr(corr_array, inplace=False):
"""
Rearranges the correlation matrix, corr_array, so that groups of highly
correlated variables are next to eachother
Parameters
----------
corr_array : pandas.DataFrame or numpy.ndarray
a NxN correlation matrix
Returns
-------
pandas.DataFrame or numpy.ndarray
a NxN correlation matrix with the columns and rows rearranged
"""
pairwise_distances = sch.distance.pdist(corr_array)
linkage = sch.linkage(pairwise_distances, method='complete')
cluster_distance_threshold = pairwise_distances.max()/2
idx_to_cluster_array = sch.fcluster(linkage, cluster_distance_threshold,
criterion='distance')
idx = np.argsort(idx_to_cluster_array)
if not inplace:
corr_array = corr_array.copy()
if isinstance(corr_array, pd.DataFrame):
return corr_array.iloc[idx, :].T.iloc[idx, :]
return corr_array[idx, :][:, idx]
# get correlations and sort them
corr = cluster_corr(df.drop(['Over18','StandardHours','EmployeeCount','EmployeeNumber'],axis = 1).select_dtypes(include='int64').corr())
# plot correlations in heatmap
sns.heatmap(corr)
<Axes: >
PCA of our highly correlated 'experience level' features from the correlation plot above suggests we can explain about 87% of the variance of these features with only the first 2 principle components. Adding more principle components does not add much information. In other words we can have 2 features in our model instead of 7 and we preserve 87% of the information they contain.
# check for number of significant principle components on year count/experience features
from sklearn.decomposition import PCA
pca = PCA()
experience_pca = pca.fit_transform(df[['Age','JobLevel', 'TotalWorkingYears','YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion','YearsWithCurrManager']])
def cum_pca_var(n_comps, cum_exp_var_ratio):
plt.plot(range(1,n_comps + 1), cum_exp_var_ratio)
plt.annotate('cumulative', (n_comps + 1,1), (n_comps+ 1,.9))
plt.xticks(range(n_comps + 1))
plt.ylabel('% of variance explained')
plt.xlabel('Principle component')
plt.title('Cumulative % of variance explained by additional components')
plt.show()
cum_pca_var(pca.n_components_,
pca.explained_variance_ratio_.cumsum()
)
Because we have 4 different types of compensation features in the model it makes sense to see if we can combine them into a fewer number. PCA reveals a great candidate for reducing as 99% of the varaince in compensation features is explained by the first 2 principle components.
# check for number of significant principle components on compensation features
pca = PCA()
sent_pca = pca.fit_transform(df[['DailyRate','HourlyRate','MonthlyIncome', 'MonthlyRate']])
cum_pca_var(pca.n_components_,
pca.explained_variance_ratio_.cumsum()
)
Lastly, we have 5 features that represent sentiment surveyed from an employee survey. PCA reveals the first 3 principle components of sentiment contain almost 80% of variance,
# check for number of significant principle components on sentiment features
pca = PCA()
sent_pca = pca.fit_transform(df[['WorkLifeBalance','RelationshipSatisfaction','JobSatisfaction',
'JobInvolvement','EnvironmentSatisfaction']])
cum_pca_var(pca.n_components_,
pca.explained_variance_ratio_.cumsum()
)
Overall PCA can help us go from 16 features to 7 without losing much information, hopefully decreasing model complexity and increasing predictive signal.
# get 2 principle components of the experience features
pca = PCA(n_components = 2)
experience_pca = pca.fit_transform(df[['Age','JobLevel', 'TotalWorkingYears','YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion','YearsWithCurrManager']])
experience_pca = pd.DataFrame(experience_pca)
experience_pca.columns = ['Experience1', 'Experience2']
# get 2 principle components of the compensation features
pca = PCA(n_components = 2)
compensation_pca = pca.fit_transform(df[['DailyRate','HourlyRate','MonthlyIncome', 'MonthlyRate']])
compensation_pca = pd.DataFrame(compensation_pca)
compensation_pca.columns = ['Compensation1', 'Compensation2']
# get 3 principle components of the sentiment features
pca = PCA(n_components = 3)
sentiment_pca = pca.fit_transform(df[['WorkLifeBalance','RelationshipSatisfaction','JobSatisfaction',
'JobInvolvement','EnvironmentSatisfaction']])
sentiment_pca = pd.DataFrame(sentiment_pca)
sentiment_pca.columns = ['Sentiment1', 'Sentiment2', 'Sentiment3']
# turn categorical features into 1's and 0's for sklearn to be able to interpret and use
df_cats = pd.get_dummies(df.drop(['Over18'], axis = 1).select_dtypes(include='object'), drop_first = True)
# combine pca features, categorical dummy features, and remaining useful numeric features
drop_cols = ['StandardHours','EmployeeCount','EmployeeNumber', 'Over18',
'WorkLifeBalance','RelationshipSatisfaction','JobSatisfaction','JobInvolvement','EnvironmentSatisfaction',
'DailyRate','HourlyRate','MonthlyIncome', 'MonthlyRate',
'Age','JobLevel', 'TotalWorkingYears','YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion','YearsWithCurrManager',
'Attrition',
'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'OverTime']
# combine dataframes
df2 = pd.concat([df.drop(drop_cols, axis = 1),
df_cats,
experience_pca,
compensation_pca,
sentiment_pca],
axis = 1)
# split into training and test sets
from sklearn.model_selection import train_test_split
X = df2.drop('Attrition_Yes', axis = 1).values
y = df2['Attrition_Yes'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 7, test_size = .3, stratify = y)
# standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
evaluate following model types on training CV accuracy:
then gridsearch cv the winning model type and evaluate on test set
# get model packages
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, KFold
# set up dictionary of model types to estimate
models = {'KNN':KNeighborsClassifier(n_neighbors = 5),
'Logit':LogisticRegression(),
'RF':RandomForestClassifier(n_estimators = 80, min_samples_leaf = 5, random_state = 7),
'GBM':GradientBoostingClassifier(n_estimators = 80, max_depth = 3, subsample = .8, max_features = .2, random_state = 7)}
# set up empty list to catch CV training scores
results_accuracy = []
results_recall = []
# loop through CV scoring on each model type
for model in models.values():
kf = KFold(n_splits = 6, random_state = 7, shuffle = True)
cv_results_accuracy = cross_val_score(model, X_train_scaled, y_train, cv = kf)
cv_results_recall = cross_val_score(model, X_train_scaled, y_train, cv = kf, scoring = 'recall')
results_accuracy.append(cv_results_accuracy)
results_recall.append(cv_results_recall)
# plot cv scores
plt.boxplot(results_accuracy, labels = models.keys())
plt.title('Total CV Accuracy')
plt.show()
# plot cv scores
plt.boxplot(results_recall, labels = models.keys())
plt.title('Total CV Recall (Accuracy on actual leaves)')
plt.show()
When evaluating our four model types we use total accuracy and recall. Recall is of particular importance because it measures the accuracy on observations where employees did actually leave. Theoretically, we want the model to correctly sort between flight risks and non flight risks, however the cost of losing an employee because we didn't properly identify them as being a flight risk is probably higher than the cost of incorrectly labeling a non flight risk as a flight risk. Focusing on recall means we are prioritizing model accuracy for those that quit.
As we can see both the logit model and the gradient boosting machine perform well overall, however the logit does better at correctly identifying leavers, specifically. It should be noted that the median CV recall score is about 38%, meaning the best model type here is still only correctly classifying leavers as leavers 38% of the time, which doesn't seem excellent.
Let's see if we can tune our logit model to perform even better on recall.
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits = 5, shuffle = True, random_state = 7)
param_grid = {'penalty' : ['l1', 'l2'],
'C' : np.logspace(-4, 4, 5),
'solver' : ['liblinear']}
logit = LogisticRegression()
logit_cv = GridSearchCV(logit, param_grid, cv = kf, scoring = 'recall')
logit_cv.fit(X_train_scaled, y_train)
print('top tuned parameters:', logit_cv.best_params_)
print('top tuned recall score:', round(logit_cv.best_score_,2))
top tuned parameters: {'C': 0.0001, 'penalty': 'l2', 'solver': 'liblinear'}
top tuned recall score: 0.4
After performing a gridsearch to test out different hyperparameter combinations for the best performing model on recall, we settle in on the above hyperparameters with an average CV recall score on the training set of about 40% (a modest improvement on the base logit model).
Now let's get our best model and evaluate it on our test data the model has never seen.
from sklearn.metrics import recall_score
logit_best = LogisticRegression(penalty = 'l2', C = .0001, solver = 'liblinear')
logit_best.fit(X_train_scaled, y_train)
y_pred = logit_best.predict(X_test_scaled)
print('test data recall score:', round(recall_score(y_test, y_pred), 2))
print('test data overall accuracy', round(logit_best.score(X_test_scaled, y_test),2))
test data recall score: 0.38 test data overall accuracy 0.86
On test data the model has never seen our best model correctly classified leavers as leavers 38% of the time and correctly classifies all employees 86% of the time. If we look back on proportion of leavers in the overall dataset, you'll remember 84% are non-leavers, meaning our model is a modest improvement over a 'assume stay' guessing strategy in overall accuracy and a significant improvement with respect to leavers (recall).
The below plot shows the magnitudes of feature influences on the odds of attrition. The way to interpret is the red bars increase attrition odds and the blue bars decrease attrition odds. For example, when an employee worked overtime that had the single biggest influence on attrition of any factor in our model and it increased odds of attrition. Some of these factors can be influenced and some cannot. For example, we cannot influence an employee to get married and we cannot magically give an employee more years of experience, but we can cap their hours so they don't work overtime and we can increase their pay.
Consequently, our model suggest the following actions could help in lowering attrition odds for an individual you hope to retain, our retention levers:
One factor that is puzzling is sentiment. Theoretically you'd think that higher sentiment would lower attrition odds but our model suggest the opposite. It should be noted that these are independent effects, meaning that this is sentiment when you remove any negative influence that overtime work, low pay, low stock options, long commutes, traveling for work, and the specific challenges of working in a specific department or job role have on your sentiment. A follow up analysis might be useful here to figure out what is driving this dynamic.
# inspect regression coefficients
coefs_df = pd.DataFrame({'Feature':list(df2.drop('Attrition_Yes', axis = 1).columns),
'Coef':logit_best.coef_[0]}).sort_values('Coef',ascending = False)
coefs_df['Coef (abs)'] = coefs_df['Coef'].abs()
coefs_df = coefs_df.sort_values('Coef (abs)', ascending = False)
coefs_df['Attrition Odds Change'] = np.where(coefs_df['Coef'] < 0, 'Decrease','Increase')
condition = coefs_df['Feature'].isin(['OverTime_Yes', 'Compensation2', 'Compensation1', 'StockOptionLevel', 'Sentiment2',
'Sentiment3', 'Sentiment1', 'BusinessTravel_Travel_Frequently', 'DistanceFromHome',
'TrainingTimesLastYear', 'BusinessTravel_Travel_Rarely', 'PercentSalaryHike'])
coefs_df.loc[np.invert(condition), 'influence'] = 'cannot influence'
coefs_df.loc[condition, 'influence'] = 'can influence'
fig, (ax1, ax2) = plt.subplots(1,2)
plt.rcParams['figure.figsize'] = [12, 8]
sns.barplot(data = coefs_df[coefs_df['influence'] == 'can influence'],
x = 'Coef (abs)',
y = 'Feature',
hue = 'Attrition Odds Change',
dodge = False,
palette = 'RdBu',
ax = ax1)
sns.barplot(data = coefs_df[coefs_df['influence'] == 'cannot influence'],
x = 'Coef (abs)',
y = 'Feature',
hue = 'Attrition Odds Change',
dodge = False,
palette = 'RdBu',
ax = ax2)
fig.suptitle('Feature relationships to attrition odds by whether they can be influenced')
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
ax1.title.set_text('can influence')
ax2.title.set_text('cannot influence')
plt.show()
In the end, our attrition model's goal is to help us identify risks and suggest ways to mitigate them.
The below is an example of how one might use the model predictions and model coefficients to identify individuals to retain and then suggest possible retention intervention strategies. The interactive plot below uses the training and test data, but a similar tool could be built around current employees, using the model we just trained and tested on past data to get leave probabilities for currently active employees.
Below I've stratified employees along dimensions which group them in buckets with the hardest to replace employees grouped together (higher performance, higher job level, business critical job roles) and easier to replace employees grouped together (lower performance, lower job level, less critical job roles). The way to loose the tool is to identify the group of employees you want to look at and then see which employees have high leave probabilities. Hovering over a single employee will show in the tooltip their specific characteristics in terms of things we have some amount of control over changing (overtime, pay, stock options, travel requirements, commute distance) which we established in our logistic regression model have independent relationships to leave probability.
Once we know which employees are hard to replace, which also have higher leave probabilities than we are comfortable with, we can determine what retention levers we might pull. For example, if overtime = yes, then a great place to start is to look into cubring or capping their hours worked. If their salary is low relative to peers, we can look at an increase to slighly above peers. Stock options, limiting travel, and offering work from home days could also be applicable.
It should be noted that our model is observational, not necessarily causal. That is, attrition risk is related to overtime work but we don't know for certain that changing overtime work with reduce attrition. None of the observations in our dataset (to our knowledge) contained values that were altered in order to reduce attrtion, they all came about in 'natural' ways, without interventions like we are suggesting. We can plug in new numbers to our model and see the attrition risk lower, but we will only know how causal the relationship is with an actual experiment.
Consequently, once we've identified a retention opportunity and a retention strategy, we should keep track of any interventions carried out in order to test their effectiveness. In a large organization it would be helpful to be strategic in terms of pulling single levers vs a combination of levers and the sample size of each intervention type so that we can be certain which levers work or what combination of levers works with a reasonable amount of certainty.
import plotly.io as pio
pio.renderers.default = 'notebook'
# get predicted attrition probabilities
y_pred_proba = pd.concat(
[pd.DataFrame(logit_best.predict_proba(X_test_scaled)),
pd.DataFrame(logit_best.predict_proba(X_train_scaled))]
)
# get original columns
X = df.drop('Attrition', axis = 1).values
y = df['Attrition'].values
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(X, y, random_state = 7, test_size = .3, stratify = y)
# set up new dataframe with original columns and add predicted quit probabilities
pred_df = pd.concat(
[pd.DataFrame(X_test_orig),
pd.DataFrame(X_train_orig)]
)
pred_df = pd.concat([pd.DataFrame(y_pred_proba), pred_df], axis = 1)
pred_df.columns = ['Stay Probability', 'Leave Probability'] + list(df.drop('Attrition', axis = 1).columns)
pred_df['Prediction'] = np.where(pred_df['Leave Probability'] >= .5, 'Leave', 'Stay')
pred_df['YearlySalary'] = pred_df['MonthlyIncome'] * 12
pred_df['Leave Probability'] = pred_df['Leave Probability']
# visualize probleave against hard to replace characteristics
import plotly.express as px
#import plotly.io as pio
#pio.renderers.default = 'notebook'
fig = px.scatter(pred_df,
x='Leave Probability',
y="JobRole",
color='Leave Probability',
color_continuous_scale='Bluered',
facet_col="PerformanceRating",
facet_row='JobLevel',
width=1000, height=1400,
category_orders={"JobLevel": [5,4,3,2,1]},
title = 'Identifying hard-to-replace employees with high leave probability',
hover_data = {'PerformanceRating':False, 'JobRole':False, 'JobLevel':False,
'EmployeeNumber':True, 'Leave Probability':':.3f', 'Prediction':True, 'OverTime':True,
'YearlySalary':True,
'StockOptionLevel':True, 'BusinessTravel':True,'DistanceFromHome':True})
fig.add_vline(x = .5, annotation_text="prediction boundary", annotation_position="bottom right" ,
line_dash = 'dot')
fig.update_layout(coloraxis_showscale=False)
fig.show()